KAFKA-20115: Group coordinator fails to unload metadata when no longer leader or follower by brandboat · Pull Request #21396 · apache/kafka

brandboat · 2026-02-03T16:34:44Z

When a broker loses leadership of a __consumer_offsets partition while a
write batch is pending, the coordinator unload process fails because
freeCurrentBatch() attempts to access partition writer configuration
which throws NOT_LEADER_OR_FOLLOWER exception.

This commit fixes the issue by skipping buffer release during unload
since all related resources are being closed anyway.

…r leader or follower Signed-off-by: Kuan-Po Tseng <brandboat@gmail.com>

dajac · 2026-02-03T17:43:17Z

cc @clolov This is a potential blocker for 4.2.

chia7712 · 2026-02-03T17:53:13Z

cc @clolov This is a potential blocker for 4.2.

I gave him the heads-up about the bad news offline. Hopefully, he's still my fried after this. 😢

squah-confluent

Thank you for the fix!
Some initial thoughts:

squah-confluent · 2026-02-03T18:28:34Z

...common/src/test/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntimeTest.java

    }

+    @Test
+    public void testScheduleUnloadingWithPendingBatchWhenPartitionWriterConfigThrows() {


Could we move this together with the rest of the testScheduleUnloading* tests?

squah-confluent · 2026-02-03T18:37:30Z

...common/src/test/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntimeTest.java

+            new CoordinatorRuntime.Builder<MockCoordinatorShard, String>()
+                .withTime(timer.time())
+                .withTimer(timer)
+                .withDefaultWriteTimeOut(Duration.ofMillis(20))


nit: Could we use DEFAULT_WRITE_TIMEOUT here unless there's a reason to use a different timeout?

Suggested change

.withDefaultWriteTimeOut(Duration.ofMillis(20))

.withDefaultWriteTimeOut(DEFAULT_WRITE_TIMEOUT)

Nice catch, thanks!

Signed-off-by: Kuan-Po Tseng <brandboat@gmail.com>

chia7712 · 2026-02-04T01:09:58Z

...common/src/test/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntimeTest.java

    }

+    @Test
+    public void testScheduleUnloadingWithPendingBatchWhenPartitionWriterConfigThrows() {


I'm thinking about adding integration tests. For example:

@ClusterTest( brokers = 2, types = {Type.KRAFT}, serverProperties = { @ClusterConfigProperty(key = GroupCoordinatorConfig.OFFSETS_TOPIC_PARTITIONS_CONFIG, value = "1"), @ClusterConfigProperty(key = "group.coordinator.append.linger.ms", value = "3000") } ) public void test(ClusterInstance clusterInstance) throws InterruptedException, ExecutionException, TimeoutException { try (var producer = clusterInstance.<byte[], byte[]>producer()) { producer.send(new ProducerRecord<>("topic", "value".getBytes(StandardCharsets.UTF_8))); } try (var admin = clusterInstance.admin()) { admin.createTopics(List.of(new NewTopic(Topic.GROUP_METADATA_TOPIC_NAME, Map.of(0, List.of(0))))).all().get(); } try (var consumer = clusterInstance.consumer(Map.of(ConsumerConfig.GROUP_ID_CONFIG, "test-group")); var admin = clusterInstance.admin()) { consumer.subscribe(List.of("topic")); while (consumer.poll(Duration.ofMillis(100)).isEmpty()) { // empty body } // append records to coordinator consumer.commitSync(); // unload the coordinator by changing leader (0 -> 1) admin.alterPartitionReassignments(Map.of(new TopicPartition(Topic.GROUP_METADATA_TOPIC_NAME, 0), Optional.of(new NewPartitionReassignment(List.of(1))))).all().get(); } Function<GroupCoordinator, List<TopicPartition>> partitionsInGroupMetrics = service -> assertDoesNotThrow(() -> { var f0 = GroupCoordinatorService.class.getDeclaredField("groupCoordinatorMetrics"); f0.setAccessible(true); var f1 = GroupCoordinatorMetrics.class.getDeclaredField("shards"); f1.setAccessible(true); return List.copyOf(((Map<TopicPartition, ?>) f1.get(f0.get(service))).keySet()); }); // the offset partition should NOT be hosted by multiple coordinators var tps = clusterInstance.brokers().values().stream() .flatMap(b -> partitionsInGroupMetrics.apply(b.groupCoordinator()).stream()).toList(); assertEquals(1, tps.size()); }

WDYT?

Sure, why not? Thanks for the thorough integration test!

Signed-off-by: Kuan-Po Tseng <brandboat@gmail.com>

chia7712 · 2026-02-04T15:46:54Z

...tor-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java

            deferredEventQueue.failAll(Errors.NOT_COORDINATOR.exception());
-            failCurrentBatch(Errors.NOT_COORDINATOR.exception());
+            // There is no need to free the current batch, as we will be closing all related resources anyway.
+            failCurrentBatch(Errors.NOT_COORDINATOR.exception(), false);


Would you mind adding a readable method? for example:

private void failCurrentBatchWithoutRelease(Throwable t) { failCurrentBatch(t, false); }

chia7712 · 2026-02-04T15:48:03Z

...tegration-tests/src/test/java/org/apache/kafka/clients/consumer/ConsumerIntegrationTest.java

+            @ClusterConfigProperty(key = GroupCoordinatorConfig.GROUP_COORDINATOR_APPEND_LINGER_MS_CONFIG, value = "3000")
+        }
+    )
+    public void testSingleCoordinatorOwnershipAfterPartitionReassignment(ClusterInstance clusterInstance) throws InterruptedException, ExecutionException, TimeoutException {


@brandboat could you open a PR against trunk to improve the test coverage?

Here you go: #21403

Signed-off-by: Kuan-Po Tseng <brandboat@gmail.com>

KAFKA-20115: Group coordinator fails to unload metadata when no longe…

31f5988

…r leader or follower Signed-off-by: Kuan-Po Tseng <brandboat@gmail.com>

lianetm added ci-approved group-coordinator labels Feb 3, 2026

squah-confluent reviewed Feb 3, 2026

View reviewed changes

Address comments

83e16f7

Signed-off-by: Kuan-Po Tseng <brandboat@gmail.com>

chia7712 reviewed Feb 4, 2026

View reviewed changes

Add integration test

99d4855

Signed-off-by: Kuan-Po Tseng <brandboat@gmail.com>

chia7712 reviewed Feb 4, 2026

View reviewed changes

Address comments

9d3874a

Signed-off-by: Kuan-Po Tseng <brandboat@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-20115: Group coordinator fails to unload metadata when no longer leader or follower#21396

KAFKA-20115: Group coordinator fails to unload metadata when no longer leader or follower#21396
brandboat wants to merge 4 commits intoapache:4.2from
brandboat:KAFKA-20115

brandboat commented Feb 3, 2026 •

edited by github-actions bot

Loading

Uh oh!

dajac commented Feb 3, 2026

Uh oh!

chia7712 commented Feb 3, 2026

Uh oh!

squah-confluent left a comment

Uh oh!

squah-confluent Feb 3, 2026

Uh oh!

squah-confluent Feb 3, 2026

Uh oh!

brandboat Feb 4, 2026

Uh oh!

chia7712 Feb 4, 2026

Uh oh!

brandboat Feb 4, 2026 •

edited

Loading

Uh oh!

chia7712 Feb 4, 2026

Uh oh!

chia7712 Feb 4, 2026

Uh oh!

brandboat Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	.withDefaultWriteTimeOut(Duration.ofMillis(20))
	.withDefaultWriteTimeOut(DEFAULT_WRITE_TIMEOUT)

Conversation

brandboat commented Feb 3, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dajac commented Feb 3, 2026

Uh oh!

chia7712 commented Feb 3, 2026

Uh oh!

squah-confluent left a comment

Choose a reason for hiding this comment

Uh oh!

squah-confluent Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

squah-confluent Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

brandboat Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

chia7712 Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

brandboat Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chia7712 Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

chia7712 Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

brandboat Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

brandboat commented Feb 3, 2026 •

edited by github-actions bot

Loading

brandboat Feb 4, 2026 •

edited

Loading